NON-LINEAR MODELS:

Basis Functions and Generalized Additive Models

Marco R. Steenbergen

University of Zurich

Theory

Motivation

  • Mid-Atlantic wage data from ISLR.

  • There appears to be a curvilinear relationship.

  • If we miss this, we may under-fit the data and generate bias error.

  • But how do we build non-linear relationships into a “linear” regression model?

  • We use basis functions in the pre-processing stage.

What Is a Basis Function?

Basis Functions as Transformations

Consider the predictive feature \(x\). We transform this feature using the basis function \(B_j(x)\), which is specified ahead of the learning process. This function can be used to introduce curvature into a model.

A basis set is a set of different basis functions each applied to \(x\) and added together to approximate a complex function so that

\[ y_i = \beta + \omega_1 B_1(x_i) + \omega_2 B_2(x_i) + \cdots + \omega_K B_K(x_i) + \varepsilon_i \qquad(1)\]

Polynomial Regression

Basis Function

For an integer \(j\),

\[ B_j(x) = x^j \qquad(2)\]

The resulting regression model takes the form of

\[ y_i = \beta + \sum_{j=1}^K \omega_j x_i^j + \varepsilon_i \]

where \(K\) is set by the researcher.

Variants of Polynomials

  • Raw polynomials take the form of \(x^j\), since they simply transform the raw feature scores.

  • Since this can induce multicollinearity, orthogonal polynomials are preferred.

Piece-Wise Polynomial Regressions

  • The polynomial is applied to the whole of age range.

  • What if we believe that different polynomials apply to different ranges?

  • For instance, we divide age into two regions such as \(\text{age} < 50\) and \(\text{age} \geq 50\).

  • We can now apply two different polynomials, for instance,

    \[ \begin{split} y_i|\text{age} < 50 &\approx \beta_1 + \omega_{11} \text{age}_i + \omega_{12} \text{age}_i^2 \\ y_i|\text{age} \geq 50 &\approx \beta_2 + \omega_{21} \text{age}_i + \omega_{22} \text{age}_i^2 \end{split} \]

  • The problem is that this setup creates a jump at age = 50.

  • Moreover, there are other discontinuities at the cutoff of 50.

Polynomial Splines

  • Imagine we impose \(k\) cutoff points or knots that divide a predictive feature into \(k+1\) regions.

  • A polynomial spline or \(B\)-spline of degree \(d\) is a piece-wise polynomial, such that at each knot we impose continuity on the derivatives up to degree \(d\).

Degrees of Freedom in B-Splines

  • Instead of tuning \(d\) and \(k\) separately, often we tune the degrees-of-freedom.

  • For \(B\)-splines, \(df = d + k\).

  • The default is typically the cubic spline, meaning that \(df = 3 + k\).

  • We typically set \(df\) through re-sampling.

A Problem with B-Splines

  • We see a lot of variance at the ends of the cubic spline.

  • This has to do with a lack of constraint at the end points.

  • We can reduce this variance through the use of natural splines.

  • Here the functional form becomes linear below the smallest knot and above the largest knot.

Generalized Additive Models

  • We can combine various functions for different predictive features into an additive model:

    \[ y_i \approx \beta + \sum_{j=1}^P f_j(x_{ij}) \] where \(f_j(.)\) is a spline with its own set of weights.

  • This is known as a generalized additive model or GAM (Hastie & Tibshirani, 1987).

Practice

Example 1: Polynomial Regression Without Tuning

The objective is to fit a polynomial regression of degree 3 to the age predictor in a wage model.

library(ISLR)
library(tidymodels)
tidymodels_prefer()
data(Wage)
set.seed(1671)
wage_split <- initial_split(Wage, prop = 0.75, strata = wage)
training_set <- training(wage_split)
test_set <- testing(wage_split)
# Standardization is not necessary but often useful
model_recipe <- recipe(wage ~ age, data = training_set) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_poly(age, degree = 3)
poly_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")
model_flow <- workflow() %>%
  add_model(poly_spec) %>%
  add_recipe(model_recipe)
wage_fit <- fit(model_flow, data = training_set)
tidy(wage_fit)
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     112.     0.849    131.   0       
2 age_poly_1      403.    40.3       10.0  4.43e-23
3 age_poly_2     -410.    40.3      -10.2  8.51e-24
4 age_poly_3      111.    40.3        2.76 5.74e- 3
wage_fit %>%
  predict(test_set) %>%
  bind_cols(test_set) %>%
  metrics(truth = wage, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     38.9   
2 rsq     standard      0.0820
3 mae     standard     27.5   

Example 2: Natural Splines with Tuning

  • Let us train a linear model of wage with age, education, and year of data collection as predictive features.

  • We create dummy variables for education.

  • We use natural splines for age and year.

  • We tune the degrees of freedom for those splines.

wage_recipe <- recipe(wage ~ age + education + year,
                      data = training_set) %>%
  step_normalize(c(age,year)) %>%
  step_dummy(education) %>%
  step_ns(age, deg_free = tune("age")) %>%
  step_ns(year, deg_free = tune("year"))
extract_parameter_set_dials(wage_recipe)
Collection of 2 parameters for tuning

 identifier     type    object
        age deg_free nparam[+]
       year deg_free nparam[+]
set.seed(20)
cv_folds <- vfold_cv(training_set, v = 10, repeats = 3)
# Full factorial
ns_grid <- crossing(
  "age" = 1:5,
  "year" = 1:5
)
ns_grid
# A tibble: 25 × 2
     age  year
   <int> <int>
 1     1     1
 2     1     2
 3     1     3
 4     1     4
 5     1     5
 6     2     1
 7     2     2
 8     2     3
 9     2     4
10     2     5
# ℹ 15 more rows
wage_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")
first_flow <- workflow() %>%
  add_model(wage_spec) %>%
  add_recipe(wage_recipe)
set.seed(1671)
ns_tune <- first_flow %>%
  tune_grid(cv_folds, grid = ns_grid, metrics = metric_set(rsq))
select_best(ns_tune)
# A tibble: 1 × 3
    age  year .config              
  <int> <int> <chr>                
1     5     1 Preprocessor21_Model1
final_flow <- first_flow %>%
  finalize_workflow(select_best(ns_tune))
final_est <- final_flow %>%
  fit(training_set)
tidy(final_est)
# A tibble: 11 × 5
   term                          estimate std.error statistic  p.value
   <chr>                            <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                       45.8      5.33      8.58 1.71e-17
 2 education_X2..HS.Grad             11.1      2.86      3.90 9.91e- 5
 3 education_X3..Some.College        24.4      3.01      8.09 9.49e-16
 4 education_X4..College.Grad        37.2      3.00     12.4  3.25e-34
 5 education_X5..Advanced.Degree     62.4      3.24     19.2  2.10e-76
 6 age_ns_1                          47.2      4.85      9.74 5.79e-22
 7 age_ns_2                          38.2      5.88      6.50 1.00e-10
 8 age_ns_3                          34.6      5.11      6.77 1.61e-11
 9 age_ns_4                          51.0     12.2       4.18 3.03e- 5
10 age_ns_5                          10.9      9.57      1.14 2.55e- 1
11 year_ns_1                         12.9      2.78      4.63 3.96e- 6

Example 2: Effect Plot

library(gam)
gam_fit <- gam(wage ~ education + ns(age, 5) + ns(year, 1),
               data = training_set)
par(mfrow = c(1,3))
plot(gam_fit, se = FALSE, col = "#386cb0")

References

Hastie, T., & Tibshirani, R. (1987). Generalized Additive Models: Some Applications. Journal of the American Statistical Association, 82(398), 371–386. DOI: 10.2307/2289439